\section{Properties of our technique}

\subsection{Performance}

We compared the performance of our technique for the \texttt{find}
function shown in \refApp{app-example-implementation} to the
performance of the \texttt{find} function shipped with \sbcl{}.

We tested the performance of our technique only on \sbcl{} because it
is one of the few implementations that has a compiler that implements
all the optimizations that our technique requires in order to perform
well.  In \refSec{sec-conclusions} we discuss what these optimizations
are.

The results show a significant performance gain compared to the
\texttt{find} function of \sbcl{}.  In fact, as it turns out, the
\sbcl{} sequence functions often require the programmer to declare the
element type (when the sequence is a vector) in order for performance
to be improved.  We have not attempted to compare our technique to
this case, for the simple reason that one of the advantages of our
technique is precisely that no additional information is required in
order for performance to be acceptable.

Most of our tests use relatively fast \texttt{test} functions such as
\texttt{eq} or \texttt{eql}.  This is a deliberate choice, as we
want to compare the performance of the traversal of the sequence,
and a more time-consuming test function would dominate the execution
time.  Because of the inherent imprecision in timing results, using
more time-consuming test functions would require us to subtract two
large timing results, thereby making the difference even less
precise.

In this paper, we report only execution time, and no other performance
parameter such as memory use or amount of allocated memory.  In fact,
the only situation where memory consumption is more than a small
constant is when the stack is used as temporary storage when
\texttt{:from-end} is true and the sequence is a list.  This situation
is fully covered in our paper dedicated to processing lists in reverse
order \cite{Durand:2015:ELS:reverse}.  Similarly, there is no
allocation of memory involved in our technique.  Client-supplied
functions for tests and key computation may of course allocate memory.

\subsubsection{Results on vectors}

When the sequence is a vector, the main performance consideration has
to do with the element type of the vector.  The parameters
\texttt{start}, \texttt{end}, and \texttt{from-end} do not
significantly alter the way the traversal is implemented.  The
\texttt{key} function may influence performance, but for the cases
that we treat specially, only unspecialized vectors are concerned.

Our first test shows the performance comparison for an unspecialized
vector with the \texttt{key} function being \texttt{identity} and the
\texttt{test} function being \texttt{eq}.  As shown in the diagram
below, our function is around three times as fast as that of \sbcl{}.

\inputeps{VECTOR-EQ-SYMBOL.eps}

The next test uses a vector whose element type is
\texttt{character}.  The \texttt{key} function is \texttt{identity},
and the \texttt{test} function is \texttt{eql}.  As shown in the
diagram below, our function is around twice as fast as that of
\sbcl{}.

\inputeps{VECTOR-EQL-CHAR.eps}

The next test shows a vector with the element type being
\texttt{(unsigned-byte 8)} and the test function being \texttt{=}.  In
this case, our function is only around $20\%$ faster than that of
\sbcl{}.  This modest improvement can be explained by the fact that
this test function is not one of the functions that we treat in a
special way.  An implementation that wishes better performance for
this case can modify the macros to reflect this desire.

\inputeps{VECTOR-EGAL-UB8.eps}

For completeness, we finally show a test for a vector with the element
type being \texttt{bit}.  In this case, our technique is slow, because
it accesses the elements one at a time, whereas a good, native
implementation of \texttt{find} would use available processor
instructions to handle an entire word at a time
\cite{Baker:1990:EIB:121989.121991}.

\inputeps{VECTOR-EGAL-BIT.eps}

\subsubsection{Results on lists}

When the sequence is a list, the concept of element type is not
applicable.  The \texttt{key} function is important, because the
sequence functions may be used for association lists.  For that
reason, we include a test with \texttt{car} as a key function.  Also,
for lists, the parameter \texttt{end} may influence the performance.
In our implementation, we specialize the loops according to whether
this parameter is \texttt{nil} or a number, allowing for two different
specialized versions of the main traversal loop.

Our first test, like the first one on a vector, traverses a list of
symbols.  The test function is \texttt{eq}, and no \texttt{end}
parameter has been given, which is equivalent to giving it the value
\texttt{nil}.  Again, our implementation is around three times as fast
as that of \sbcl{}.

\inputeps{LIST-EQ-SYMBOL.eps}

In the next test, the sequence is a list containing only bignums.  The
\texttt{test} function is \texttt{=}.  As the diagram shows, our
technique is only moderately faster than that of \sbcl{}.

\inputeps{LIST-EGAL-BIGNUM.eps}

In the next test, the sequence is a list where each element is a pair
where the first element is a bignum, so that we can use \texttt{car}
as a \texttt{key} function.  The \texttt{test} function is still
\texttt{=}.  There seems to be no significant difference for this
case, compared to the previous one, suggesting that the native
implementation of \texttt{car} is so fast that calling \texttt{=} will
dominate the computation.

\inputeps{LIST-EGAL-CAR.eps}

In the next test, the sequence is a list of pairs of symbols; it uses
\texttt{car} as key function, \texttt{eq} as test function and a non
\texttt{nil} value for the \texttt{end} parameter. Our implementation
is at least twice as fast as that of \sbcl{}.

\inputeps{LIST-EQ-CAR-END.eps}

Our final test uses a non \texttt{nil} value for the \texttt{end}
parameter.  Despite the fact that we use a slightly more expensive
\texttt{test} function (namely \texttt{=}), the performance of our
implementation is very good; it is around three times as fast as that
of \sbcl{}.

\inputeps{LIST-EGAL-END.eps}

\subsection{Maintainability}

From the point of view of maintainability, there are clear advantages
to our technique.  With only a small amount of macro code, we are able
to hide the implementation details of the functions, without
sacrificing performance.

The small amount of macro code that is needed to make our technique
work is clearly offset by the considerable decrease in the code size
that would otherwise have been required in order to obtain similar
performance.

\subsection{Disadvantages}

There are not only advantages to our technique.

For one thing, compilation times are fairly long, for the simple
reason that the body of the function is duplicated a large number of
times.  Ultimately, the compiler eliminates most of the code, but
initially the code is fairly large.  And the compiler must do a
significant amount of work to determine what code can be eliminated.
To give an idea of the orders of magnitude, in order to obtain
fully-expanded code on \sbcl{}, we had to increase the inline limit from
$100$ to $10\thinspace000$, resulting in a compilation time of tens of
seconds for a single function.  For the case of a vector, the inner
loop will be replicated for each element type, then each of these
replicas will again be replicated for each special case of the
\texttt{test} and \texttt{test-not} functions, and again for each
special case of a \texttt{key} function, and finally for the two cases
of processing from the beginning or the end.  The resulting number of
replicas of the inner loop depends on the number of such special
cases, but can easily exceed $1000$.  Each replica will ultimately be
stripped down by the compiler because redundant tests will be
eliminated for each one.

Another disadvantage of our technique is that it doesn't lend itself
to applications using short sequences.  For such applications, it
would be advantageous to inline the sequence functions, but doing so
would make each call site suffer the same long compilation times as we
now observe for the ordinary callable functions.

Not all compilers are able to optimize the main body of a function
according to some enclosing condition.  For a \commonlisp{}
implementation with a more basic compiler, no performance improvement
would be observed.  In addition, the duplication of the main body of
the function would result in a very large increase of the code size,
compared to a simpler code with the same performance.

For the special case of bit vectors, our technique will not be able to
compete with a good native implementation of the sequence functions.
The reason is that, despite the optimizations that the compiler can
perform with our technique, the body of a typical sequence function
still consists of a loop where each iteration treats a single element.
A native implementation would not treat a single element in an
iteration.  Instead, it would take advantage of instructions that
exist in most processors for handling an entire \emph{word} at a time,
which on a modern processor translates to $64$ bits.  An
implementation that uses our technique would then typically handle bit
vectors as a special case, excluded from the general technique.
